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Escherichia coli - one of the most characterized bacteria and a major public health concern - 
remains invisible across the temporal landscape. Here, we present the meticulous recon- 
struction of the first ancient £. coli genome from a 16" century gallstone from an Italian 
mummy with chronic cholecystitis. We isolated ancient DNA and reconstructed the ancient 
E. coli genome. It consisted of one chromosome of 4446 genes and two putative plasmids 
with 52 genes. The E. coli strain belonged to the phylogroup A and an exceptionally rare 
sequence type 4995. The type VI secretion system component genes appears to be hor- 
izontally acquired from Klebsiella aerogenes, however we could not identify any pathovar 
specific genes nor any acquired antibiotic resistances. A sepsis mouse assay showed that a 
closely related contemporary E. coli strain was avirulent. Our reconstruction of this ancient E. 
coli helps paint a more complete picture of the burden of opportunistic infections of the past. 
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victims has almost exclusively focused on historically sig- 

nificant mortality events, such as the Black Death, revealing 
the evolutionary history of canonical pathogens such as Yersinia 
pestis, Mycobacterium tuberculosis!, and Variola virus. In contrast, 
much human morbidity and mortality is the result of opportunistic 
infections, ones that often remain invisible in the past. Opportu- 
nistic pathogens, those without historical records—such as 
Escherichia coli, Pseudomonas aeruginosa, and Staphylococcus 
aureus—have been understudied relative to their contemporary 
burdens on people today”. They are defined by their ability to 
infect during periods of stress, imbalance, or disturbance while 
being otherwise commensal or saprophytic®. Opportunistic infec- 
tions likely played an important role in human mortality in our 
shared past and thus have had a broader impact on human health 
than can be or has been measured. 

A confounding factor in the identification of historically 
understudied pathogens is that the resultant infections were likely 
primarily opportunistic. That is, they colonized their host envir- 
onments asymptomatically, leaving no identifiable pathologies and 
as such are not easily identified in human remains*. Ancient DNA 
studies typically focus on obligatory pathogens such as M. leprae, 
Salmonella enterica, and Y. pestis that are easily correlated with 
pathologically distinct, or historically relevant mortality events and 
as such are easily distinguished as foreign in ancient human 
metagenomic DNA read data°-’. Opportunistic infections have the 
added burden of authentication due to their ubiquity as environ- 
mental contaminants and of modern commensal strains. Historical 
identification of these pathogens would allow for a careful assess- 
ment of their evolutionary history and the commensal-pathogen 
continuum defined by gene content gain and loss as strains 
modulate toward or away from host virulence. 

E. coli is one such pathogen. It is a common commensal bac- 
teria found in vertebrate gut microbiome® that can also become 
an opportunistic pathogen under specific conditions’. E. coli has 
such a large impact on our health care systems that it is the 
subject of several vaccine development efforts to mitigate the 
effects of the most pathogenic strains!°. Having been responsible 
for several food poisoning outbreaks and becoming a leading 
pathogen for deaths caused by antimicrobial resistance, E. coli is 
thus a key bacterium used in public health surveillance!!. 

Global sampling of E. coli strains produces a tree with several 
unique phylogenetic groups and interspersed pathovars!*. While 
the phylogenetic relationships between strains are constructed 
based on genetic similarity, pathovars are defined by the virulence 
traits of their members. In many cases, members of the same 
pathovar do cluster together in the same clade, however, as 
virulence genes can be acquired horizontally, some pathovars— 
like enteroaggregative E. coli—are distributed across multiple 
phylogenetic groups!”. This striking diversity and the evolutionary 
transitory states among E. coli strains highlight their genomic 
plasticity and evolutionary versatility along this aforementioned 
continuum. Ancient E. coli gnomes would provide useful insights 
into the forces that influence the emergence of commensalism and 
pathogenicity in bacteria. Here, we describe the reconstruction of a 
16th-century E. coli genome characterized from the gallstone of an 
Italian noble—Giovani d’Avalos (1538-1586)—highlighting a 
genome with commensal characteristics. 


T he recovery of ancient pathogen DNA (aDNA) from human 


Results 

The mummified remains. In 1983, the mummified remains of 
several Italian nobles were recovered from the Abbey of Saint 
Domenico Maggiore in Naples, Italy (Supplemental Fig. 1). A 
careful paleopathological and histological survey of one of the 
individuals—Giovani d’Avalos (NASD1), a Neapolitan noble who 


died in 1586 at the age of 48 (further information can be found in 
the Supplementary Notes section of the Supplemental Materials) 
—revealed thickened gallbladder walls, Rokitanski-Aschoff sinu- 
ses, and several intact gallstones!>. These features suggested that 
NASD1 may have suffered from chronic cholecystitis!4. While 
not the only cause of cholecystitis, chronic bacterial infections 
from E. coli, Bacteroides spp., and Clostridium spp. can lead to the 
formation of gallstones!°. These infections are typically indicated 
by a brown pigment, as can be seen in the gallstone from 
NASD1!5 (Fig. 1a, b). 


Authenticating the ancient DNA. Metagenomic profiles gener- 
ated from the DNA extracts using Kraken 2!° provided pre- 
liminary evidence for the substantial and increasing presence of 
Enterobacteriales in digest rounds 2 (outer layer), 3-4 (middle 
layer), and 5-6 (inner layer) from the gallstone while indicating 
its absence in the tissue samples of NASD1 (Supplemental Fig. 2). 
Enterobacteriales reads are also present in digest round 1 however 
they are comparable with those in the reagent blanks (Supple- 
mental Fig. 2), thus likely to be a mixture of endogenous and 
contaminant DNA such as Bradyrhizobium. Species-level indi- 
cation suggested that E. coli made up the largest proportion of 
reads identified as Enterobacteriales. Fortunately, E. coli-specific 
reads are virtually absent (<0.002%) from the blanks and present 
in the second through sixth digests. In comparison, other taxa 
such as Alcaligenes and Bradyrhizobium were present in both the 
experimental samples and the blanks and are likely to be con- 
taminants. Human assigned reads were present in all digests but 
were also detected in the blanks. Klebsiella aerogenes was also 
detected, however only in digest rounds 3-4. To assess the 
authenticity of the Human, E. coli, and K. aerogenes reads we 
examined the deamination and depurination kinetics (Fig. 1c, d 
and Supplemental Fig. 3). 


Reconstructing the ancient E. coli genome and putative plas- 
mids. The reconstructed ancient E. coli genome had a mean read 
depth of 28.46[28.31, 28.62], mean heterozygosity of 
7.66 x 10~4[6.66 x 10-4, 8.66 x 10-4], 4446 genes when mapped 
to the E. coli pan-genome dataset and a total estimated size of 
4050, 429bp when the detected genes are concatenated. Sig- 
nificant differences (two-sided t-test, P= 7.951 x 1095) were 
detected when comparing read depths of the core (3122 genes, 
28.73[28.61, 28.86] x) and accessory (1324genes, 27.82[27.39, 
28.26] x) genomes (Fig. 2a and Table 1); the same was true for 
their mean heterozygosity with the core at 2.74x10~4 
[2.34 x 10-4, 3.15 x 10-4] and accessory genome with 1.92 x 10~3 
[1.61x 10-3, 2.24x10~3](Supplemental Fig. 4). Eighty-three 
genes were identified as having a substantially (at least two 
standard deviations) deeper gene coverage. These results were 
then compared to the closely related A phylogroup E. coli gen- 
omes of FSIS11816402, an sequence type (ST) 4995 strain, and 
K-12 MG1655, an ST10 strain (Fig. 2b and Supplemental Fig. 5). 
Both genome comparisons returned a lower mean depth with a 
larger variance of coverage. This is especially true for 
FSIS11816402 as 25.24% of its contigs had a mean coverage of 
less than 1 x (Supplemental Fig. 6). 

We identified two potential plasmids within the assembled 
scaffolds with MOB-suite!”: E. coli MDR_56 plasmid unnamed 
5 (CP019906.1) and S. flexneri 1a strain 0228 plasmids 
(CP012732.1). Subsequent mappings of the E. coli reads back 
to these reference plasmids confirmed their presence in the 
ancient strain and authenticated their origin (Fig. 2c and 
Supplemental Figs. 5, 7). One plasmid, CP019906.1 had a 
read coverage similar to that of K-12 (20.95[20.80, 21.10] x) 
and 34 genes, whereas, CP012732.1 had lower coverage 
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Fig. 1 Characterization of both the physical sample and the authentication of the sequenced data. a Liver and gallbladder of Giovani d'Avalos. The 
gallstones can be seen in the red rectangle. Note its dark brown coloring. Scale bar represents 1cm b Gallbladder with a thickened wall (a) and Rokitansky- 
Aschoff sinuses (b) (Hematoxylin-eosin, 3X and 250X). The scale bar for (a) represents 2 cm and 100 pm for (b). e Damage plots of the 5’ and 3’ ends of 
mapped reads for E.coli. Colors refer to the digest, not damage type. Mapdamage 2.052 was used to calculate the damage rates. d Fragment length 
distribution of deduplicated mapped reads from E. coli. A log, scale is used to emphasize the differences between the digests. A minimum length of 30 bp 


was required for a read to be kept. 


(16.46[16.28, 16.64] x) and only 19 genes. These plasmids contain 
regions with no read coverage—including a ~6.6 Kbp region in 
CP012732.1—but we could not confirm if the genes were part 
of the chromosome. 

A subset of the sequenced reads were also classified as 
belonging to K. aerogenes by Kraken 2!°. To confirm or refute 
this result, unmapped reads were aligned against the K. aerogenes 
(NC_015663.1) reference genome. The overall read depth 
(0.38[0.37, 0.38] x) and genome coverage was low (3.43%), 
suggesting that only a subset of genes from K. aeorgenes was 
present. However, a 38 Kbp section of the genome (Fig. 2d) had a 
read mapping depth similar to the ancient E. coli results 
(20.36[20.23, 20.50] x). 


Phylogenetic reconstruction. In order to place the ancient E. coli 
within the global phylogeny and help identify the strain, we 
produced a core SNP alignment of 451 E. coli and four Shigella 
genomes representing the current breadth of E. coli diversity. 
After removing redundancy, the resulting phylogeny contained 
107 genomes. This pruned sample set was then used to create a 
new alignment of 5007 core SNPs which returned an overall 
topology resembling previously published results and placed the 
ancient strain into a strongly supported phylogeny within phy- 
logroup A (Fig. 3a). 

To more carefully refine the position of the ancient strain 
within group A, we generated a reduced ML tree using 94 
genomes (Fig. 3b) consisting of 291 SNPs. The ancient genome 


clustered closely with the ST4995 strains with some statistical 
support (bootstrap support of 65%), signifying that it is likely part 
of the same sequence type (Table 2). An even further refined ML 
tree was then generated using the 22 available ST4995 genomes 
from Enterobase!® (Fig. 3c). The core SNP alignment 
contained 16,026 nucleotides, providing a much greater resolu- 
tion than the global and AO subgroup alignments. The ancient 
strain clearly clusters within the ST4995 genomes. 

We scanned for temporal signals across all phylogenies using 
TEMPest!? and found one when the phylogeny was restricted to 
the ST4995 genomes (Supplemental Fig. 8), a result confirmed a 
date-randomization test in LSD (P = 0.003). The ST4995 phylogeny 
(Fig. 3c) suggests an 11th century (1027[787, 1220]CE) tMCRA for 
ST4995 strains, nearly 500 years before the diversification of the 
modern strains. The estimated evolutionary rate across the 
phylogeny was 2.555 x 10~®[1.567 x 10~®, 3.992 x 10~®] subs/site/ 
year, similar to previously published results?°. 

Multilocus sequence and typing of our ancient E. coli genome 
confirmed its placement in the ST4995—a rare sequence type with 
only 22 strains in the Enterobase!® database which encom- 
passes 182,476 entries!®. It exhibited an Onovell5:H? serotype, 
similar to its sister taxa (see the Supplementary Discussion section 
in the Supplementary Materials for full results). A different fimH 
variant — fimH86 - was present in our E. coli strain. 


Global gene content. The majority (95%) of the E. coli strains 
contained 3190 core genes, 3122 of which were detected in our 
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Fig. 2 Read coverage metrics for the E. coli pan-genome in comparison to other genomes. a Distribution of mean gene coverages with a CV <1 for the 
ancient genome. The dashed line indicates the detection threshold at 10x. The black rectangle on the right indicates the area where genes with a high copy 
number (as defined by x + 2 * s). b Coverage plot for FSIS11816402. A window of 1% was used for illustration purposes. ¢ Coverage plot for CP019906.1. 
A window of 0.1% was used for illustration purposes. The first track indicates the coverage with the red line illustrating the overall mean. The second track 
indicates the number of SNPs over the same window while the third is the GC content. d Gene coverage of the T6SS in K. aerogenes using a 100 bp window. 


Gene names are included when available. 


Table 1 Read coverage metrics for the E. coli pan-genome, 
FSIS11816402, K-12 MG1655, CP019906.1, and 
CP012732.1 references. 


% Positions with 
coverage >1* 


97.93% 
77 84% 
78.30% 
80.62% 
64.01% 


Sequence Mean coverage 


Pan-Genome 
FSIS11816402 
K-12 MG1655 
CP019906.1 
CP012732.1 


28.46[28.31, 28.62] x 
18.00[16.17, 19.83] x 
20.94[20.93, 20.95] x 
20.95[20.80, 21.10] x 
16.46[16.28, 16.64] x 


FSIS11816402 used the mean of means greater than 1x as it was an assembly. The 95% 
confidence interval is listed in the square brackets. 


ancient genome (Supplemental Fig. 9). Enrichment analysis of the 
missing core genes revealed a significant lack of flagellar assembly 
genes (P = 0.0445), however, the protein-protein network con- 
tained the expected number of interactions (P= 0.11). An ana- 
lysis of regions with mapped ancient reads in CP019906.1 
revealed the presence of genes involved in biofilm formation 
(NP_311382.1) and a multidrug efflux pump (acrD). For 
CP012732.1, in contrast, reads mapped to pseudogenes are 
likely related to osmoprotection and transcription regulations. 
A presence/absence (P/A) analysis of the accessory genome 
recapitulates the results from the ML phylogeny results (Fig. 4a) 


and previously published results?!. The accessory PCoA indicates 
that the genome is a member of phylogroup A and an ancestral 
member of ST4995. The latter is confirmed by a P/A analysis of 
an ST4995-only pan-genome (Supplemental Fig. 10). 

A total of 91 virulence factors were identified in the ancient 
genome, 37 of which were not found in E. coli K-12 MG1655 (see 
Table 3 for gene families with more than one hit). Type VI 
secretion system (T6SS) components consisted of the majority of 
these genes and contained high copy numbers. The T6SS— 
formed by the tss gene family as well as hcp, vasK, and vgrG— 
mediates antagonistic interactions between competing bacteria. In 
addition to its role in bacterial killing, T6SS is involved in 
interbacterial signaling, biofilm formation, and phage defense. 
Parts of a type III secretion system were also detected at much 
lower levels. 

We also observed incomplete virulence gene complements 
involved in several mechanisms. Several genes belonging to the 
ecp and fim fimbriae families were found in the ancient genome. 
While they are involved in virulence??>74, they can also be 
present in commensal strains. Fimbriae known to be associated 
with virulence such as cfaB and csg were also identified? yet they 
do not provide enough information to pinpoint the specific 
pathovar of the ancient strain?*>4. The key virulence factors for 
the E. coli pathovars such as eae and stx for Shiga toxin- 
producing E. coli (STEC), heat-labile and heat-stable genes for 
Enterotoxigenic E. coli (ETEC), and the type III secretion system 
in EPEC and EHEC were absent?. 
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Fig. 3 Maximum likelihood SNP phylogeny of E. coli. a The global phylogeny with bootstrap values and phylogenetic groups with E. coli EC42405 as the 
out-group. b Phylogeny of the reduced subgroup AO as defined by the Clermont genotype (+ - - -)®". Tip points represent the sequence type of the strain. 
IAI1 was the out-group of the phylogeny. ¢ Phylogeny of the ST4995 strains. Red rectangles represent the 95% confidence interval for the topology; labels 
indicate the median date of divergence. The evolutionary rate for the phylogeny was 2.555 x 10-®[1.567 x 10-®, 3.992 x 10-©] subs/site/year. 
CFSANO51544 was the out-group. The brown arrows indicate the position of the ancient genome. 


Table 2 SNP distances of seven strains from the ST4995 phylogeny. 


Genomes CFSANO51544_ ‘Ancient Ecoli FSIS11816402 
CFSANO51544 ) 77) 869 

Ancient Ecoli 771 0 419 
FSIS11816402 869 419 0 

ATCC 11229 920 503 588 
ESC_SA9272AA 9771 554 639 
ESC_DB2295AA _ 1012 596 681 
ESC_CA2237AA 1024 607 692 


CFSAN051544 was the out-group of the phylogeny. 


ATCC 11229 ESC_SA9272AA _ESC_DB2295AA _ ESC_CA2237AA 
920 971 1012 1024 

503 554 596 607 

588 639 681 692 

fe) 215 219 230 

215 ) 308 319 

219 308 0 323 

230 319 323 ) 


In contrast astA, a key virulence factor in Enteroaggregative E. 
coli (EAEC)*°, was detected in the ancient genome. In 
combination with evidence that T6SS is commonly found in 
EAEC strains”, this may indicate that the ancient strain was been 
a member of this pathovar. The remaining EAEC virulence 
factors in the pan-genome—aggA, aggR, and aap—however, 
were not detected?°. The ancient genome also contains a 
complete enterobactin system which is responsible for iron 
uptake”>. The enterobactin system and astA gene were also found 
in E. coli K-12 MG1655, once again indicating that they do not 
confer enough virulence on their own. 

To evaluate the extraintestinal virulence potential of ST4995 
we tested a close relative (507 core SNPs separating the 
two strains) of the ancient genome, the reference strain 


ATCC11229 (see Fig. 3b and Supplemental Fig. 11 for 
similarities to the modern strain) in a well-calibrated mouse 
sepsis assay where ten mice were inoculated with the strain and 
their death recorded?®. None of the mice were killed by 
ATCC11229, whereas all of the mice infected by the positive 
control B2 phylogroup CFTO073 strain were killed. The 
phenotype of the strain which killed the mice is linked to the 
presence of specific extraintestinal virulence genes with a major 
role in the iron capture system encoding genes as those bared 
by the high-pathogenicity island (HPI)*”?°. This result is in 
agreement with the absence of typical extraintestinal virulence 
genes present in the ATCC11229 and ancient strain. 

We searched our ancient genome for the presence of 
antimicrobial resistance (AMR) genes using the Resistance 
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Fig. 4 Principal coordinate analyses (PCoA) of gene presence/absence data. The accessory genome (a, ¢) and identified virulence genes (b, d) were 
clustered using a binary distance. Phylogroups were identified using Clermont Typing?? while pathovars were determined using available metadata 
from the Patric“? database. The ancient genome is indicated by the black arrow. In both binary distances were calculated prior to create the PCoA. 


Table 3 Identified virulence factors which contained genes not found in K-12 MG1655. 


Gene family Genes 


tss 13 
elf 4 


Mean coverage 
38.21[32.31, 44.10] x 


31.08[28.27, 33.89] x 
36.02[22.49, 49.54] x 


hcp 
cfa 23.01[5.01, 41.02] x 


vas 
vgr 


51.62[49.80, 53.44] x 
47.53[46.14, 48.91] x 


Function Pathovar 


Type VI Secretion System? EAEC? 

Fimbriae23 EAEC, EHEC, ExPEC23 
Type VI Secretion System23 EAEC, EHEC, ExPEC?:23 
Fimbriae?? EAEC, EHEC, ExPEC?3 
Type VI Secretion System23 EAEC23 


Type VI Secretion System23 EAEC23 


EAEC enteroaggregative E. coli, EHEC enterohemorrhagic E. coli, ExPEC extraintestinal pathogenic E.coli. 


Only families which had more than one distinct gene are present. 


Gene Identifier?® and found 47 distinct genes in the scaffold 
assembly of our ancient E. coli genome. Of these, 35 genes were 
also identified in the global pan-genome P/A analysis. Eight genes 
contained duplicates with mdtB being the most commonly 
detected. The remaining subset spanned thirteen antimicrobial 
drug resistance classes with nine classes represented more than 
once (Table 4). Resistance cassettes targeting the five most 
common drug classes were detected in the ancient E. coli strain. 
The majority of these genes are multidrug efflux pumps, which is 
typical of E. coli3°. There were no unexpected AMR genes present 
in our ancient genome with all 35 being found in E. coli K-12 
MG1655. 


Discussion 

The DNA isolated from the stone showed clear evidence of DNA 
damage. Specifically, the deamination plots for the ancient reads 
mapping to the E. coli pan-genome reference contained char- 
acteristic aDNA profiles which indicated increased deamination 
rates at or near the terminal bases (Fig. 1c). In addition depur- 
ination kinetics plots of the E. coli mapped reads from libraries 
with sufficient E. coli read depth (digests 2, 3-4, 5-6) showed that 
fragments from internal sections of the gallstone (better protected 
from hydrolysis) were on average longer (34 bp median in digest 
2 to 37 bp median in digests 5-6) (Fig. 1d). We obtained similar 
results for the K. aerogenes data across digests two through six. 
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Table 4 AMR profile of the ancient E. coli. 


With efflux No efflux 


fluoroquinolone antibiotic 17 
penam 13 
tetracycline antibiotic 10 
cephalosporin 6 
peptide antibiotic 
phenicol antibiotic 
cephamycin 
glycylcycline 
rifamycin antibiotic 
triclosan 
carbapenem 
penem 
monobactam 


Resistances 


Fa a ee ee Oe es ee ny 


-NNAREA BRU 


Antibiotic results from an RGI2? search of the E. coli scaffolds. A total of 29 genes were 
identified, with genes appearing multiple times. Only drug classes were detected at least once 
without the use of antibiotic effluxes. A single gene can be present in multiple drug classes. 


Interestingly, the mapped human reads lack these signature fea- 
tures (Supplemental Fig. 3). More importantly, however, is that 
they confirm that the E. coli and K. aerogenes sequences are 
indeed ancient and were not the result of modern contamination. 

Two potential explanations exist for the lack of deamination 
signal in the mapped human reads. The first is that the sample 
likely contains a mixture of both contaminating and to a lesser 
degree authentic ancient human DNA. This is almost certainly 
the case for the first digest, as it represents the outer layer of the 
gallstone, which would have been subjected to contamination 
from handling, storage, sampling, etc. However, as we moved to 
deeper layers of the gallstone, presumably better protected from 
contamination, the lack of endogenous human DNA was sur- 
prising. This is especially striking with the edit distances for both 
human and E. coli reads from across the different digests (Sup- 
plemental Fig. 12). The more likely explanation is that there are 
simply not enough human DNA reads for meaningful damage 
analysis. Gallstones are typically formed from a combination of 
cholesterol, bile salts, and phosphatidylcholine!>. Furthermore, in 
the highly acidic environment of the gallbladder, only bacteria 
that are specialized in surviving these conditions can thrive and 
potentially form a stone!>. Since gallstones are not formed from 
direct human cellular components its interior is likely virtually 
devoid of human DNA, which is very different from the results 
we obtained from the DNA of ancient abscesses*!. 

The gene depth results of the ancient E. coli genome confirm the 
lack of modern contaminants as they are part of a unimodal normal 
distribution. The sole exception to these coverage stats are 83 genes 
with substantially larger mean depths (Supplemental Data 2). These 
genes consist of primarily hypothetical proteins (24.10%), inter- 
grases and transposons (19.28%), and the T6SS (13.25%). They also 
contain a significantly (P=5.03x10—!!) greater mean hetero- 
zygosity (1.23 x 10~2[9.25 x 10~3, 1.53 x 10~2]) than the rest of the 
ancient genome. These high copy number genes, however, are 
mainly accessory genes (96.34%) which suggests—in combination 
with the greater heterozygosity and presence of transposons—that 
they were located in mobile elements shared between multiple 
bacterial species. 

The plasmids we detected in the scaffold assembly, 
CP019906.1 and CP012732.1, had lower gene depth cov- 
erages than that for our reconstructed chromosome (Table 1). 
Most plasmids are typically present in greater copy numbers than 
chromosomes*”, however, this was not the case as the ancient 
chromosome had a significant (P= 0 using Tukey’s method to 
compare coverage estimates between the plasmids and ancient 


chromosome) deeper mean coverage than CP019906.1 and 
CP012732.1. One possibility is that these two plasmids are 
actually not present, but rather the genes to which the ancient 
reads mapped were actually located in the ancient chromosome. 
This is supported by the fact that the mean gene depths for the 
plasmids fall firmly within the coverage distribution of the 
chromosome (Supplemental Table 2). This could explain large 
regions of the plasmids that lack any mapped reads, as well as the 
lack of reads mapping to intergenic regions (see Fig. 2c and 
Supplementary Fig. 5B). 

The reads mapping to the K. aeorgenes genome could represent 
a similar situation to the plasmid data. Interestingly, the 38 Kbp 
region in the K. aerogenes chromosome which had a read depth 
similar yet significantly different (P = 0 using Tukey’s method to 
compare coverage estimates between the K. aerogenes T6SS and 
the ancient E. coli chromosome) from the plasmids and E. coli 
chromosome (20.36 + 13.56 x) contains the Type VI Secretion 
System (T6SS), which is ubiquitous across Gram-negative bac- 
teria and plays an important role in antagonistic interactions”. 
Given that T6SS is widespread among bacteria, its presence is not 
an exclusive marker for K. aerogenes. However, competitive read 
mapping between E. coli and K. aerogenes does suggest that this 
T6SS does indeed belong to K. aerogenes rather than the homo- 
logous system in E. coli suggesting a recent horizontal transfer, 
which is supported by a comparison of their relative GC 
contents? (Supplemental Fig. 13). 

The structure of our global phylogeny resembles those pre- 
viously published** and places the ancient genome in phylogroup 
A with meaningful support (100% bootstrap for a clade of three 
strains). Interestingly, other historical (~1800s) E. coli strains 
have also been placed in this phylogroup*°. Strains that have been 
isolated from humans living in less industrialized and more rural 
communities typically harbor strains falling within phylogroup 
A8. The late Medieval date of our sample, along with its phylo- 
genetic placement helps confirm its authenticity. In addition, a 
study of E. coli strains isolated from bile infections showed that 
they belong mainly to the phylogenetic A group*® 

We identified a temporal signal in the reduced phylogeny 
(Fig. 3c) which provides additional context for the evolutionary 
history of ST4995 strains. The majority of the sequence type can 
trace their lineage to a strain from approximately 1592[1479, 1690] 
CE, however, the divergence of our ancient genome and two others 
- ESC_AA9618AA AS and ESC_VA4573AA _AS ~ has a deeper 
tMRCA ca. 956[689, 1172] CE. Given the disparity in ages between 
the two clades, it is likely that ST4995 is actually composed of at 
least two subgroups. Evidence supporting this can be seen in an 
MDS of the P/A analysis of all existing ST4995 genomes (Supple- 
mentary Fig. 9). 

The P/A analysis of the accessory genes recapitulates the results 
from the ML phylogenies (Fig. 4a). The accessory PCoA confirms 
that the ancient strain is a member of phylogenetic group A while 
a subsequent P/A analysis with an ST4995-only pan-genome 
supports the proposition that the genome is a member of ST4995 
(Supplemental Fig. 9). Phylogenetic group A—the group from 
which ST4995 stems—is a well-described clade and is the most 
commonly isolated type from human commensals’. E. coli strain 
diversity is driven primarily by socioeconomic factors as indivi- 
duals living in modern industrialized countries are more likely to 
carry B2 strains while those in less industrialized and rural 
communities harbor mostly A®. Given the time period NASD1 
lived, our ancient E. coli is a member of phylogenetic group A. 

No acquired AMR genes were detected in the ancient genome 
confirming the age of the strain. Post removal of the multidrug 
efflux pumps from our genome, no other AMR genes were found 
when compared to E. coli K-12 MG1655*°. The latter is extremely 
pertinent as strong resistance to a particular drug class requires 
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several mechanisms*’. Thus, it is likely that only mild resistances 
to these antimicrobial compounds were needed>”*8. 

The identified ancient genome likely exhibited commensal 
characteristics as it shares most of its virulence factors with E. coli 
K-12 MG1655. The ancient strain lacks the canonical virulence 
factors for STEC and ETEC while also missing the components of 
the type III secretion system used in EHEC. Components of 
EAEC—astA and the T6SS°—were detected, yet other key viru- 
lence factors were missing and astA was also detected in K-12 
MG1655. Results from the P/A analysis of the accessory genes 
with pathovar labels (Fig. 4c) and the virulence genes (Fig. 4d) 
present a similar picture. The ancient genome is positioned in a 
relatively general cluster containing a variety of different patho- 
vars which include EHEC, ETEC, ExPEC, and STEC. The mouse 
sepsis model further reinforces that the ancient genome was an 
opportunistic pathogen as the proxy strain—ATCC 11229—did 
not infect any mice. This information, in combination with the 
identified virulence genes and P/A clustering of the ancient E. coli 
suggests that the ancient genome is an opportunistic pathogen 
that acquired a K. aerogenes T6SS gene cassette during expansion 
in the gallbladder, consistent with previous work showing Kleb- 
siella sp often associated with E. coli in contemporary bile 
infections*®. 


Methods 

Sample, DNA extraction, library preparation, and sequencing. We performed 
DNA extractions on a single gallstone—that had been isolated from the mum- 
mified remains of NASD1—in the clean-room facilities of the McMaster Ancient 
DNA Centre (McMaster University, Hamilton, Ontario, Canada) using meth- 
odologies specific to ancient DNA. We hypothesized that the successive miner- 
alized layers of the stone would contain increasingly better-preserved DNA and 
thus subjected the gallstone to six successive rounds of demineralization and 
digestion*®. Subsequently, DNA was extracted from 250 uL of the supernatant 
stemming from each round, which was then purified and prepared into dual- 
indexed libraries, size-selected to remove artifacts, and sequenced on the Illumina 
HiSeq 1500 platform with 90 bp paired-end reads*!#*. Three additional tissue types 
from the same individual (bladder, small intestine, and lung) were processed and 
sequenced as above for comparative purposes. Full details of the methods can be 
found in the Supplementary Methods section in the Supplementary Materials. 


Bioinformatics. Adapters were identified through AdapterRemoval*® with 
trimming and merging performed by leeHom* using its aDNA setting. 
Sequencing lanes were pooled when applicable and Kraken 2!° determined the 
overall metagenomic composition of the samples. This was done to characterize the 
metagenome and identify taxa of interest. A standard Kraken 2!° database with 
some modifications was used—a kmer size of 25 bp and no minimizer—to account 
for the smaller read lengths of aDNA fragments. Reads underwent string dedu- 
plication using prinseq* prior to the metagenomic analysis. 

Samples were then mapped against the human genome GRCh38*° with BWA‘ 
with a maximum edit distance of 0.01 (-n 0.01), at most two gap openings 
(-o 2), and seeding effectively disabled (-1 16500)*8. A minimum length of 
30 bp and mapping quality of 30 was required for a read to be successfully 
matched. Fragments that did not successfully map were then compared with an E. 
coli pan-genome using the same settings. The pan-genome was created with E. coli 
genomes from PATRIC*? and four Shigella genomes from NCBI. Specifically, 
genomes which were found in humans, chickens, cows, dogs, mice, or pigs and 
were determined to have a “good” genome quality (as defined by PATRIC) were 
selected resulting in 451 strains. These sequences were annotated with Prokka™? 
using the proteins from E. coli K-12 MG1655 (NC_000913.3) obtained from 
NCBI as trusted sequences. The resulting annotations were compiled by Roary*! 
to create the pan-genome. Paralogs were not split and a minimum blastP 
identity of 90% were used as additional settings. Unsuccessfully mapped human 
reads were also compared with K. aerogenes (NC_015663.1) as it was 
substantially— proportional abundance >1%—present in one of the digests while 
also being excluded from the blanks. 

Successfully mapped human, E. coli, and K. aerogenes reads were deduplicated 
based on their 5’ and 3’ coordinates with bam-rmdup (https://bitbucket.org/ 
ustenzel/biohazard-tools/src/master/). The deduplicated samples were pooled 
together based on their digest of origin (1, 2, 3-4, or 5-6). MapDamage 2 . 0° was 
used to estimate the amount of deamination across the mapped reads. Fragment 
length distributions (FLD) and mapping mismatches were also extracted. 
Heterozygosity was tested by calling single nucleotide polymorphisms (SNPs) using 
bcftools with ploidy set to two’ and a quality threshold of 100. 

Following authentication, an assembly was created with the complete set of 
mapped E. coli and K. aerogenes reads (i.e., not deduplicated) using SPAdes 


3.14.1 in isolate mode with customized kmer lengths—9, 19, and 29. The 
resulting assembly was an incomplete and highly fragmented scaffold. This was 
mostly due to the short median fragment lengths (35 bp) of the input reads and the 
required kmer sizes. Nevertheless, the N50 of the assembly was 939 bp with a total 
length of 3, 917, 339 bp which is substantially shorter than the complete E. coli 
chromosomes in PATRIC (5.00[4.96, 5.05]Mbp). MOB-suite was used to 
identify potential plasmids in the assembly, however, the final identification was 
done by mapping to the reference plasmids!’. 


Phylogenetic analysis. We created an SNP alignment using the E. coli genomes 
from PATRIC and the Shigella genomes using Snippy (https://github.com/ 
tseemann/snippy) with E. coli EC42405 (CP043414.1) as the reference strain. 
Non-canonical base pairs were replaced with an N and recombinant regions were 
removed with gubbins*®. As the ancient genome contained several gaps the 
“filter_percentage” flag was increased to 32% to ensure that the assembly was not 
excluded. Once recombination sites were identified and removed a maximum 
likelihood phylogeny was created using IQ-TREE 2°°. ModelFinder was used 
to select the appropriate phylogenetic model with ascertainment bias®” and the tree 
was bootstrapped 1000 times. Redundant genomes were removed from the phy- 
logeny using Treemmer making sure to retain 95% of the diversity—measured by 
Root-to-Tip length -, which returned 107 genomes®®. A new SNP alignment and 
phylogeny was then created using these pruned genomes. The modern strains were 
labeled based on known E. coli phylogenetic groups using ClermonTyping™®. 
Four E. coli strains which were identified as part of A were relabeled as potential 
Shigella strains. This is due to the presence of an ipaH gene in two of the genomes 
and their positioning in the phylogeny. 

Upon determining the phylogroup of the ancient genome, the subgroup AO E. 
coli genomes were used to carefully position the ancient strain within the same 
clade. This subgroup is defined by the Clermont genotype (+ - - -)°!. An additional 
seven enterotoxigenic E. coli strains obtained from von Mentzer et al. 2014 were 
also included in the phylogeny®*. This phylogeny was created using the same 
process noted previously and was rooted using IAI] Bl (NC_011741.1). The 
phylogeny of the ST4995 strains was created with the 22 ST4995 genomes available 
in Enterobase!® and rooted using the $T325 strain CFSAN051544 
(SAMNO5414627). To determine if the data contained any temporal signature, 
sampling dates or sequencing dates were assigned to the pruned phylogeny and a 
Root-to-tip (RIT) regression was performed for the global, AO subgroup, and 
ST4995 phylogenies. The presence of a temporal signal was then assessed via a 
date-randomization test and with TEMPest!?. A molecular clock was fit to the data 
using a least-squares dating approach using LSD2°,64, 

The dated ST4995 phylogeny was created by dating the bootstrap trees with 
LSD2 (using -r 1 n-s 1500000 as settings)®>. These trees were then 
summarized using treeannotator® with the height of the nodes set by the 
mean of the bootstrap distribution. The resulting phylogeny was then used to plot 
the dated phylogeny with the node height confidence intervals included. 


Gene function analysis. To determine what genes were present in the ancient 
strain, we converted the read depths of the ancient E. coli pan-genome into a gene 
presence/absence (P/A) metric. We then used a conservative approach, identifying 
a gene as present if it had an average sequence depth of at least 10 x with a 
coefficient of variation (CV) <1. A CV filter was used to remove genes, which 
contained substantial regions of stacked reads. In comparison to a percent coverage 
threshold, a CV is more permissive with unmapped areas if the rest of the gene 
coverage is relatively consistent. An additional threshold of x + 2 * s was used to 
identify genes with a high copy number. While still part of the ancient genome 
these genes could potentially belong to plasmids*? or represent gene 
amplification®. 

The P/A matrix for modern strains was generated by Roary®!. A 95% core 
genome of the modern strains was identified and genes missing from our ancient 
strain were submitted to STRING® to be functionally annotated using a network 
consisting of “high confidence” interactions. Virulence factors were identified using 
a curated set of known virulence genes’, Ambiguously named genes (i.e., 
group_1000) were run through blast X™ against the RefSeq non-redundant 
protein database from 2019-11-29 with a maximum E-value of 10~° to determine 
candidates. If multiple hits were found for a gene the smallest E-value was selected. 
Ties were resolved by choosing the gene with the highest bitscore followed by the 
highest percent identity. A minimum percent identity of 90% was required for a 
match to be kept and the ambiguous gene renamed. The descriptions of the 
previously ambiguous genes were searched for the keywords “secretion, 
invasion, and enterotoxin”. Genes with hits from this search were included 
with the virulence genes. To determine the pathovar of the ancient genome, 
relevant information was extracted from the PATRIC metadata. 

We performed a P/A analysis on the core, accessory genomes, and the virulence 
genes. The strains were transformed into a binary distance matrix and clustered 
using a PCoA. To ensure that redundant genes were not included, two exclusion 
criteria were created: genes that were ubiquitously present were excluded, and 
genes not present in at least five genomes were removed from the accessory 
genome analysis. Of the 451 E. coli genomes, four were removed from the pan- 
genome analysis due to low gene counts (see Supplemental Fig. 14 for gene counts 
and Supplemental Data 1 for removed genomes). The Resistance Gene Identifier 
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(RG)? with default settings identified potential antimicrobial resistances in the 
ancient strain. Only RGI entries returned due to gene homologies and present in 
the pan-genome were accepted. Both the identified virulence and resistance genes 
were compared to the archetypal E. coli strain K-12 MG1655. It is avirulent and 
susceptible to antibiotics”?. 

Ten 14-16 g (4-week-old) female mice OF1 from Charles River” (L’Arbresle, 
France) per strain received a subcutaneous injection of 0.2 ml of bacterial 
suspension in the neck (2 x 108 colony-forming unit). Time to death was recorded 
during the following 7 days. Mice surviving more than 7 days were considered 
cured and sacrificed. The E. coli CFT073 strain was used as a positive control killing 
all the inoculated mice whereas the E. coli K-12 MG1655 strain was used as a 
negative control for which all the inoculated mice survive*®. The protocol 
(n°APAFIS#4948) was approved by the French Ministry of Research and by the 
ethical committee for animal experiments. The ATCC11229 strain (AMC 198) was 
obtained from the Institut Pasteur collection (CIP 103795) and whole-genome 
sequenced using Illumina technology to verify the identity of the strain. This strain 
is of commensal origin and is commonly used in bacterial resistance testing. 


Reporting summary. Further information on research design is available in the Nature 
Research Reporting Summary linked to this article. 


Data availability 

The raw sequencing data has been uploaded to NCBI as PRJNA810725 with supporting 
metadata located in Supplementary Table 1. Access to the sample itself can be arranged 
by contacting G.S.L. 


Code availability 

The scripts and additional data used to analyze the data after pan-genome mapping are 
available at https://github.com/longg2/AncientEcoli”!. The scripts used for the initial 
mapping and pan-genome creation is available at https://github.com/longg2/ 
LongBioinformatics/tree/vAncEcoli”2. 
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